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ABSTRACT 

The implementation of a vast majority of machine learning 
(ML) algorithms boils down to solving a numerical optimiza¬ 
tion problem. In this context, Stochastic Gradient Descent 
(SGD) methods have long proven to provide good results, 
both in terms of convergence and accuracy. Recently, sev¬ 
eral parallelization approaches have been proposed in order 
to scale SGD to solve very large ML problems. At their 
core, most of these approaches are following a MapReduce 
scheme. 

This paper presents a novel parallel updating algorithm for 
SGD, which utilizes the asynchronous single-sided commu¬ 
nication paradigm. Gompared to existing methods. Asyn¬ 
chronous Parallel Stochastic Gradient Descent (ASGD) pro¬ 
vides faster convergence, at linear scalability and stable ac¬ 
curacy. 

1. INTRODUCTION 

The enduring success of Big Data applications, which typ¬ 
ically includes the mining, analysis and inference of very 
large datasets, is leading to a change in paradigm for ma¬ 
chine learning research objectives [3. With plenty data at 
hand, the traditional challenge of inferring generalizing mod¬ 
els from small sets of available training samples moves out of 
focus. Instead, the availability of resources like GPU time, 
memory size or network bandwidth has become the domi¬ 
nating limiting factor for large scale machine learning algo¬ 
rithms. 

In this context, algorithms which guarantee useful results 
even in the case of an early termination are of special inter¬ 
est. With limited (GPU) time, fast and stable convergence 
is of high practical value, especially when the computation 
can be stopped at any time and continued some time later 
when more resources are available. 

Parallelization of machine learning (ML) methods has been 
a rising topic for some time (refer to for a comprehensive 
overview). However, until the introduction of the MapRe¬ 
duce pattern, research was mainly focused on shared mem¬ 
ory systems. This changed with the presentation of a generic 
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Figure 1: Evaluation of the scaling properties of dif¬ 
ferent parallel gradient descent algorithms for ma¬ 
chine learning applications on distributed memory 
sytems. Results show a K-Means clustering with 
k=10 on a 10-dimensional target space, represented 
by ~1TB of training samples. Our novel ASGD 
method is not only the fastest algorithm in this 
test, it also shows better than linear scaling per¬ 
formance. Outperforming the SGD parallelization 
by [20| and the MapReduce based BATCH op¬ 
timization, which both suffer from communication 
overheads. 

MapReduce strategy for ML algorithms in , which showed 
that most of the existing ML techniques could easily be 
transformed to fit the MapReduce scheme. 

After a short period of rather enthusiastic porting of algo¬ 
rithms to this framework, concerns started to grow if follow¬ 
ing the MapReduce ansatz truly provides a solid solution for 
large scale ML. It turns out, that MapReduce’s easy paral¬ 
lelization comes at the cost of poor scalability . The main 
reason for this undesired behavior resides deep down within 
the numerical properties most machine learning algorithms 
have in common: an optimization problem. In this context, 
MapReduce works very well for the implementation of so 
called batch-solver approaches, which were also used in the 
MapReduce framework of [^. However, batch-solvers have 
to run over the entire dataset to compute a single iteration 
step. Hence, their scalability with respect to the data size 
is obviously poor. 






Even long before parallelization had become a topic, most 
ML implementations avoided the known drawbacks of batch- 
solvers by usage of alternative online optimization methods. 
Most notably, Stochastic Gradient Descent (SGD) methods 
have long proven to provide good results for ML optimiza¬ 
tion problems However, due to its inherent sequential 
nature, SGD is hard to parallelize and even harder to scale 
[16] . Especially when communication latencies are causing 
dependency locks, which is typical for parallelization tasks 
on distributed memory systems [20| . 

The aim of this paper is to propose a novel, lock-free paral¬ 
lelization method for the computation of stochastic gradient 
optimization for large scale machine learning algorithms on 
cluster environments. 

1.1 Related Work 

Recently, several approaches |20| [It] |16||13| towards an 
effective parallelization of the SGD optimization have been 
proposed. A detailed overview and in-depth analysis of their 
application to machine learning can be found in [20| . 

In this section, we focus on a brief discussion of four related 
publications, which provided the essentials for our approach: 


• A theoretical framework for the analysis of SGD paral¬ 
lelization performance has been presented in . The 
same paper also introduced a novel approach (called 
SimuParallelSGD), which avoids communication and 
any locking mechanisms up to a single and final MapRe¬ 
duce step. To the best of our knowledge, SimuPar¬ 
allelSGD is currently the best performing algorithm 
concerning cluster based parallelized SGD. A detailed 
discussion of this method is given in section |2.3[ 

• In [It], a so-called mini-BATGH update scheme has 
been introduced. It was shown that replacing the strict 
online updating mechanism of SGD with small accu¬ 
mulations of gradient steps can significantly improve 
the convergence speed and robustness (also see section 
2.4 ). 

• A widely noticed approach for a “lock-free” paralleliza¬ 
tion of SGD on shared memory systems has been in¬ 
troduced in [^. The basic idea of this method is 
to explicitly ignore potential data races and to write 
updates directly into the memory of other processes. 
Given a minimum level of sparsity, they were able 
to show that possible data races will neither harm 
the convergence nor the accuracy of a parallel SGD. 
Even more, without any locking overhead, sets 
the current performance standard for shared memory 
systems. 

• A distributed version of has been presented in , 
showing cross-host GPU to GPU and GPU to GPU 
gradient updates over Ethernet connections. 


In [^, the concept of a Partitioned Global Address 
Space programming framework (called GASPI) has been 
introduced. This provides an asynchronous, single¬ 
sided communication and parallelization scheme for 
cluster environments (further details in section 3.1). 
We build our asynchronous communication on the ba¬ 
sis of this framework. 


1.2 Asynchronous SGD 

The basic idea of our proposed method is to port the “lock- 
free” shared memory approach from to distributed mem¬ 
ory systems. This is far from trivial, mostly because com¬ 
munication latencies in such systems will inevitably cause 
expensive dependency locks if the communication is per¬ 
formed in common two-sided protocols (such as MPI mes¬ 
sage passing or MapReduce). This is also the motivation for 
SimuParallelSGD to avoid communication during the 
optimization: locking costs are usually much higher than 
the information gain induced by the communication. 

We overcome this dilemma by the application of the asyn¬ 
chronous, single-sided communication model provided by [^: 
individual processes send mini-BATGH updates com¬ 
pletely uninformed of the recipients status whenever they 
are ready to do so. On the recipient side, available updates 
are included in the local computation as available. In this 
scheme, no process ever waits for any communication to be 
sent or received. Hence, communication is literally “free” (in 
terms of latency). 

Of course, such a communication scheme will cause data 
races and race conditions: updates might be (partially) over¬ 
written before they are used or even might be contra pro¬ 
ductive because the sender state is way behind the state of 
the recipient. 

We resolve these problems by two strategies: first, we obey 
the sparsity requirements introduced by |^. This can be 
achieved by sending only partial updates to a few random 
recipients. Second, we introduce a Parzen-window func¬ 
tion, selecting only those updates for local descent which are 
likely to improve the local state. Eigurej^ gives a schematic 
overview of the ASGD algorithm’s asynchronous communi¬ 
cation scheme. The remainder of this paper is organized as 
follows: first, we briefly review gradient descent methods in 
section and discuss further aspects of the previously men¬ 
tioned related approaches in more detail. Section gives a 
quick overview of the asynchronous communication concept 
and its implementation. The actual details of the ASGD 
algorithm are introduced in section followed by a theo¬ 
retical analysis and an extensive experimental evaluation in 
section O 

2. GRADIENT DESCENT OPTIMIZATION 

Erom a strongly simplified perspective, machine learning 
tasks are usually about the inference of generalized mod¬ 
els from a given dataset X = {xo,... ,Xm} with Xi G 
which in case of supervised learning is also assigned with 
semantic labels Y = {po,... ,Pm},yi 

During the learning process, the quality of a model is eval¬ 
uated by use of so-called loss-functions, which measure how 
well the current model represents the given data. We write 
Xj{w) or [xj^yj){w) to indicate the loss of a data point for 
the current parameter set w of the model function. We will 
also refer to w as the state of the model. The actual learning 
is then the process of minimizing the loss over all samples. 
This is usually implemented via a gradient descent over the 
partial derivative of the loss function in the parameter space 
of w. 


2.1 Batch Optimization 

The numerically easiest way to solve most gradient descent 
optimization problems is the so-called batch optimization. 
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Figure 2: Overview of the asynchronous update communication used in ASGD. Given a cluster environment 
of R nodes with H threads each, the blue markers indicate different stages and scenarios of the communication 
mode. I: Thread 3 of node 1 finished the computation of of its local mini-batch update. The external buffer 
is empty. Hence it executes the update locally and sends the resulting state to a few random recipients. II: 
Thread 1 of node 2 receives an update. When its local mini-batch update is ready, it will use the external 
buffer to correct its local update and then follow I. Ill: Shows a potential data race: two external updates 
might overlap in the external buffer of thread if — 1 of node 2. Resolving data races is discussed in section 
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A state wt at time t is updated by the mean gradient gen¬ 
erated by ALL samples of the available dataset. Algorithm 
gives an overview of the BATCH optimization scheme. 
A MapReduce parallelization for many BATCH optimized 

Algorithm 1 BATCH optimization with samples X = 
{xo,..., Xm}, iterations T, steps size e and states w 

1: for all t = 0 ... T do 
2: Init wt+i = 0 

3: update = wt - ^^(^Xjex) dwXj{wt) 

4: wt+i = Wt+i/\X\ 


machine learning algorithms has been introduced by [^. 

2.2 Stochastic Gradient Descent 

In order to overcome the drawbacks of full batch optimiza¬ 
tion, many online updating methods have been proposed. 
One of the most prominent is SGD. Although some proper¬ 
ties of Stochastic Gradient Descent approaches might pre¬ 
vent their successful application to some optimization do¬ 
mains, they are well established in the machine learning 
community [^. Following the notation in [^, SGD can be 
formalized in pseudo code as outlined in algorithm The 

Algorithm 2 SGD with samples X = {xo,... ,Xm}, itera¬ 
tions T, steps size e and states w 

Require: e > 0 
1: for all t = 0 ... T do 

2: draw j G {1... m} uniformly at random 

3: update wt+i ^ wt — edwXj{wt) 

4: return wt 


advantage in terms of computational cost with respect to 
the number of data samples is eminent: compared to batch 
updates of quadratic complexity, SGD updates come at lin¬ 
early growing iteration costs. At least for ML-applications, 


SGD error rates even outperform batch algorithms in many 
cases [^. 

Since the actual update step in line 3 of algorithm plays 
a crucial role deriving our approach, we are simplifying the 
notation in this step and denote the partial derivative of the 
loss-function for the remainder of this paper in terms of an 
update step A: 

Xj(wt) := dwXjiwt). ( 1 ) 

2.3 Parallel SGD 

The current “state of the art” approach towards a parallel 
SGD algorithm for shared memory systems has been pre¬ 
sented in [^. The main objective in their ansatz is to avoid 


Algorithm 3 SimuParallelSGD with samples X — 
{xo,..., Xm}, iterations T, steps size e, number of threads n 
and states w _ 

Require: e > 0, n > 1 
1: define H = I 

2: randomly partition X, giving H samples to each node 

3: for all i G {1,..., n} parallel do 
4: randomly shuffle samples on node i 

5: init wq = 0 

6: for all t = 0 ... T do 

7: get the tth sample on the zth node and compute 

8: update wl+i wl - eAtiwl) 

9: aggregate v = ^ Wt 

10: return v 


communication between working threads, thus preventing 
dependency locks. After a coordinated initialization step, 
all workers operate independently until convergence (or early 
termination). The theoretical analysis in unveiled the 
surprising fact that a single aggregation of the distributed 



























































results after termination is sufficient in order to guarantee 
good convergence and error rates. 

Given a learning rate (i.e. step size) e and the number of 
threads n, this formalizes as shown in algorithmic 


However, adopting sequential algorithms to fit such a pat¬ 
tern is far from trivial. This is mostly because of the in¬ 
evitable loss of information on the current state of the sender 
and receiver. 


2.4 Mini-Batch SGD 

The mini-batch modification introduced by tries to unite 
the advantages of online SGD with the stability of BATGH 
methods. It follows the SGD scheme, but instead of up¬ 
dating after each single data sample, it aggregates several 
samples into a small batch. This mini batch is then used to 
perform the online update. It can be implemented as shown 
in algorithmic 


Algorithm 4 Mini-Batch SGD with samples X — 
{xo,. • •, Xm}, iterations T, steps size e, number of threads n 
and mini-batch size h _ 

Require: e > 0 
1: for all t = 0 ... T do 

2: draw mini-batch M ^ b samples from X 

3: Init Aict = 0 

4: for all x G M do 

5: aggregate update Aw ^ dwXj{wt) 

6: update wt+i ^ wt — eAwt 

7: return wt 


3. ASYNCHRONOUS COMMUNICATION 

Figure 1C shows the basic principle of the asynchronous com¬ 
munication model compared to the more commonly applied 
two-sided synchronous message passing scheme. An overview 



Figure 3: Single-sided asynchronous communication 
model (right) compared to a typical synchronous 
model (left). The red areas indicate dependency 
locks of the processes pl,p2, waiting for data or ac¬ 
knowledgements. The asynchronous model is lock- 
free, but comes at the price that processes never 
know if and when and in what order messages reach 
the receiver. Hence, a process can only be informed 
about past states of a remote computation, never 
about the current status. 

of the properties, theoretical implications and pitfalls of par¬ 
allelization by asynchronous communication can be found 
in [^. For the scope of this paper, we rely on the fact 
that single-sided communication can be used to design lock- 
free parallel algorithms. This can be achieved by design 
patterns propagating an early communication of data into 
work-queues of remote processes. Keeping these busy at all 
times. If successful, communication virtually becomes “free” 
in terms of latency. 


3.1 GASPI Specification 

The Global Address Space Programming Interface (GASPI) 
1^ uses one-sided RDMA driven communication with remote 
completion to provide a scalable, flexible and failure tolerant 
parallelization framework. GASPI favors an asynchronous 
communication model over two-sided bulk communication 
schemes. The open-source implementation GPI 2.(Q pro¬ 
vides a G++ interface of the GASPI specification. 
Benchmarks for various application^ show that the GASPI 
communication schemes can outperform MPI based imple¬ 
mentations 1^ for many applications. 


4. THE ASGD ALGORITHM 

The concept of the ASGD algorithm, as described in section 
|1.2| and figure |^ is formalized and implemented on the basis 
of the SGD parallelization presented in |^. In fact, the 
asynchronous communication is just added to the existing 
approach. This is based on the assumption that communica¬ 
tion (if performed correctly) can only improve the gradient 
descent - especially when it is “free”. If the communica¬ 
tion interval is set to infinity, ASGD will become SimuPar- 
allelSGD. 

Parameters 

ASGD takes several parameters which can have a strong 
influence on the convergence speed and quality (see experi¬ 
ments for details on the impact): 

T defines the size of the data partition for each thread, e 
sets the gradient step size (which needs to be fixed following 
the theoretic constraints shown in |^), b sets the size of 
the mini-batch aggregation, and I gives the number of SGD 
iterations for each thread. Practically, this also equals the 
number of data points touched by each thread. 

Initialization 

The initialization step is straight forward and analog to 
SimuParallelSGD |^ : the data is split into working pack¬ 
ages of size T and distributed to the worker threads. A con¬ 
trol thread generates initial, problem dependent values for 
wq and communicates wq to all workers. From that point on, 
all workers run independently, following the asynchronous 
communication scheme shown in figure |^ 

It should be noted, that wq also could be initialized with 
the preliminary results of a previously early terminated op¬ 
timization run. 

Updating 

The online gradient descent update step is the key leverage 
point of the ASGD algorithm. The local state wl of thread 
i at iteration t is updated by an externally modified step 
At(ieJ^^), which not only depends on the local At{wUi) but 
also on a possible communicated state w^, from an unknown 

^Download available at http://www.gpi-site.com/gpi2/ 
^Further benchmarks available at http://www.gpi- 
site.com / gpi2/benchmarks / 













iteration t' at some random thread j: 

= wl-^ (wl + (2) 


For the usage of N external buffers per thread, we generalize 
equation § to: 


At(«;j+i) = wi - (E^=i «') + «^l) + At(«;j+i), 

where lA'I:. El, A(»?), AW) = { J " 

( 3 ) 

with N incoming messages. Figure ^ gives a schematic 
overview of the update process. 


4.1 Parzen-Window Optimization 

As discussed in |1.2| and shown in figurethe asynchronous 
communication scheme is prone to cause data races and 
other conditions during the update. Hence, we introduce 
a Parzen-window like function (5(i, j) to avoid “bad” update 
conditions. The data races are discussed in section lT4l 

^ ^ ' I 0 otherwise ’ 

(4.) 

We consider an update to be “bad”, if the external state w^, 
would direct the update away from the projected solution, 
rather than towards it. Figure shows the evaluation of 
(5(z, j), which is then plugged into the update functions of 
ASGD in order to exclude undesirable external states from 
the computation. Hence, equation § turns into 


At( 




Wt 




5{i,j) + At(wl+i) (5) 


and equation to 

At(M;tEi)= wt - 1 /(e!Li ('5(*,«)) + l) 

•(En=i W,nX,) + W) (6) 


Computational Costs 

Obviously, the evaluation of (5(i, j) comes at some compu¬ 
tational cost. Since 8{i^j) has to be evaluated for each re¬ 
ceived message, the “free” communication is actually not so 
free after all. However, the costs are very low and can be 
reduced to the computation of the distance between two 
states, which can be achieved linearly in the dimensional¬ 
ity of the parameter-space of w and the mini-batch size: 
0(||ie|). Experiments in section^ show that the impact of 
the communication costs are neglectable. 

In practice, the communication frequency | is mostly con¬ 
strained by the network bandwidth between the compute 
nodes, which is briefly discussed in section [475] 


4.2 Mini-Batch Extension 

We further alter the update of our ASGD by extending it 
with the mini-batch approach introduced in section [2~4| The 
motivation for this is twofold: first, we would like to benefit 
from the advantages of mini-batch updates shown in [17| . 
Also, the sparse nature of the asynchronous communication 
forces us to accumulate updates anyway. Otherwise, the 


external states could only affect single SGD iteration steps. 
Because the communication frequency is practically bound 
by the node interconnection bandwidth, the size of the mini¬ 
batch b is used to control the impact of external states. 

We write Am in order to differentiate mini-batch steps from 
single sample steps At of sample xt'. 


AM(wj+i) 



(wl + 


+ Am(wJ_|_i) (7) 


Note, that the step size e is not independent of h and should 
be adjusted accordingly. 


4.3 The final ASGD Update Algorithm 

Reassembling our extension into SGD, we yield the final 
ASGD algorithm. With mini-batch size 6, number of itera¬ 
tions T and learning rate e the update can be implemented 
like this: At termination, all nodes w'^,i E {!,...,n} hold 


Algorithm 5 ASGD (X = {xq, ..., Xm}, T, e, wq, b) 

Require: e > 0, n > 1 
1: define H = I 

2: randomly partition A, giving H samples to each node 

3: for all i E {1,..., n} parallel do 
4: randomly shuffle samples on node i 

5: init wq = 0 

6: for all t = 0 ... T do 

7: draw mini-batch M ^ b samples from X 

8: update lej+i ^ w\ — eAM(^J+i) 

9: send w\j^i to random node ^ i 

10: return w} 


small local variations of the global result. As shown in al¬ 
gorithm one can simply return one of these local models 
(namely w}) as global result. Alternatively, we could also 
aggregate the w} via map reduce. Experiments in section 
15.51 show that in most cases the first variant is sufficient and 
faster. 


4.4 Data races and sparsity 

Potential data races during the asynchronous external up¬ 
date come in two forms: Eirst, the complete negligence of 
an update state because it has been completely over¬ 
written by a second state w^. Since ASGD communication 
is de-facto optional, a lost message might slow down the con¬ 
vergence by a margin, but is completely harmless otherwise. 
The second case is more complicated: a partially overwrit¬ 
ten message, i.e. reads an update from while this is 
overwritten by the update from w^. 

We address this data race issue based on the findings in 
[16| . There, it has been shown that the error which is in¬ 
duced by such data races during an SGD update is linearly 
bound in the number of conflicting variables and tends to 
underestimate the gradient projection. also showed that 
for sparse problems, where the probability of conflicts is re¬ 
duced, data race errors are negligible. Eor non sparse prob¬ 
lems, showed that sparsity can be induced by partial 
updating. We apply this approach to ASGD updates, leav¬ 
ing the choice of the partitioning to the application, e.g. for 
K-Means we partition along the individual cluster centers of 
the states. Additionally, the asynchronous communication 
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Figure 4: ASGD updating. This figure visualizes the update algorithm of a process with state wl, its 
local mini-batch update and received external state wi for a simplified 1-dimensional optimization 

problem. The dotted lines indicate a projection of the expected descent path to an (local) optimum. I: Initial 
setting: is computed and is in the external buffer. II: Parzen-window masking of Wt- Only if 

the condition of equat ion Q is met, wj will contribute to the local update. Ill: Computing AmK%i). IV: 
Updating ^ wl - 


model causes further sparsity in time, as processes read ex¬ 
ternal updates with shifted delays. This further decreases 
the probability of races. 

4.5 Communication load balancing 

We previously discussed that the choice of the communica¬ 
tion frequency | has a significant impact on the convergence 
speed. Theoretically, more communication should be bene¬ 
ficial. However, due to the limited bandwidth, the practical 
limit is expected to be far from h — 1. 

The choice of an optimal h strongly depends on the data 
(in terms of dimensionality) and the computing environ¬ 
ment: interconnection bandwidth, number of nodes, CPUs 
per node, NUMA layout and so on. Hence, 5 is a parameter 
which needs to be determined experimentally. 

For most of the experiments shown in section we found 
500 < b < 2000 to be quite stable. 

5. EXPERIMENTS 

We evaluate the performance of our proposed method in 
terms of convergence speed, scalability and error rates of 
the learning objective function using the K-Means Cluster¬ 
ing algorithm. The motivation to choose this algorithm for 
evaluation is twofold: First, K-Means is probably one of the 
simplest machine learning algorithms known in the literature 
(refer to for a comprehensive overview). This leaves little 
room for algorithmic optimization other than the choice of 
the numerical optimization method. Second, it is also one 
of the most populaij^ unsupervised learning algorithms with 
a wide range of applications and a large practical impact. 


finds a partition such that the squared error between the 
empirical mean of a cluster and the points in the cluster is 
minimized. 

It should be noted, that finding the global minimum of the 
squared error over all k clusters E{w) is proven to be a NP- 
HARD problem [^. Hence, all optimization methods inves¬ 
tigated in this paper are only approximations of an optimal 
solution. However, it has been shown [^, that K-Means 
finds local optima which are very likely to be in close prox¬ 
imity to the global minimum if the assumed structure of k 
clusters is actually present in the given data. 


Gradient Descent Optimization 

Following the notation given in [^, K-Means is formalized 
as minimization problem of the quantization error E{w)\ 

= , (8) 


where w = {wk} is the target set of k prototypes for given 
m examples {xi} and Si{w) returns the index of the closest 
prototype to the sample Xi. The gradient descent of the 
quantization error E(w) is then derived as A(w) = ^ . 
For the usage with the previously defined gradient descent 
algorithms, this can be reformulated to the following update 
functions with step size e. Algorithms and use a batch 
update scheme. Where the size m — m for the original 
BATCH algorithm and m! « m for our ASGD: 


A(wk) = 



Xi — Wk 

0 


k — Si{w) 
otherwise 


(9) 


5.1 K-Means Clustering 

K-Means is an unsupervised learning algorithm, which tries 
to find the underlying cluster structure of an unlabeled vec¬ 
torized dataset. Given a set of m n-dimensional points 
X — {xi}^ z = 1,..., m, which is to be clustered into a set of 
k clusters, w = {wk}, k = 1,... ,k. The K-Means algorithm 


■^The original paper 
times. 
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has been cited several thousand 


The SGD (algorithm]^ uses an online update: 



if /c = Si(w) 
otherwise 


( 10 ) 


Implementation 

We applied all three previously introduced gradient descent 
methods to K-Means clustering: the batch optimization with 










MapReduce (algorithm [^, the parallel SGD (algo¬ 
rithm]^ and our proposed ASGD (see algorithm [^method. 
We used the G++ interface of GPI 2.0 and the G++11 
standard library threads for local parallelization. 

To assure a fair comparison, all methods share the same 
data 10 and distribution methods, as well as an optimized 
MapReduce method, which uses a tree structured commu¬ 
nication model to avoid transmission bottlenecks. 

5.2 Cluster Setup 

The experiments were conducted on a Linux cluster with a 
BeeGF^parallel file system. Each compute node is equipped 
with dual Intel Xeon E5-2670, totaling to 16 GPUs per node, 
32 GB RAM and interconnected with EDR Infiniband. 

If not noted otherwise, we used a standard of 64 nodes 
to compute the experimental results (which totals to 1024 
GPUs). 

5.3 Data 

We use two different types of datasets for the experimental 
evaluation and comparison of the three investigated algo¬ 
rithms: a synthetically generated collection of datasets and 
data from an image classification application. 

Synthetic Data Sets 

The need to use synthetic datasets for evaluation arises from 
several rather profound reasons: (I) the optimal solution is 
usually unknown for real data, (II) only a few very large 
datasets are publicly available, and, (III) we even need a 
collection of datasets with varying parameters such as di¬ 
mensionality n, size m and number of clusters k in order to 
evaluate the scalability. 

The generation of the data follows a simple heuristic: given 
n, m and k we randomly sample k cluster centers and then 
randomly draw m samples. Each sample is randomly drawn 
from a distribution which is uniquely generated for the in¬ 
dividual centers. Possible cluster overlaps are controlled 
by additional minimum cluster distance and cluster vari¬ 
ance parameters. The detailed properties of the datasets 
are given in the context of the experiments. 

Image Classification 

Image classihcation is a common task in the field of com¬ 
puter vision. Roughly speaking, the goal is to automatically 
detect and classify the content of images into a given set of 
categories like persons, cars, airplanes, bikes, furniture and 
so on. A common approach is to extract low level image fea¬ 
tures and then to generate a “Godebook” of universal image 
parts, the so-called Bag of Eeatures [^. Objects are then 
described as statistical model of these parts. The key step 
towards the generation of the “Godebook” is a clustering of 
the image feature space. 

In our case, large numbers oi d — 128 dimensional HOG 
features were extracted from a collection of images and 
clustered to form “Godebooks” with k = 100,..., 1000 en¬ 
tries. 


5.4 Evaluation 

Due to the non-deterministic nature of stochastic methods 
and the fact that the investigated K-Means algorithms might 

^see www.beegfs.com for details 


get stuck in local minima, we apply a 10-fold evaluation of 
all experiments. If not noted otherwise, plots show the mean 
results. Since the variance is usually magnitudes lower than 
the plotted scale, we neglect the display of variance bars in 
the plots for the sake of readability. If needed, we report sig¬ 
nificant differences in the variance statistics separately. To 
simply the notation, we will denote the SimuParallelSGD 
algorithm by SGD, the MapReduce baseline method 
by BATGH and our algorithm by ASGD. Eor better com¬ 
parability, we give the number of iterations I as global sum 
over all samples that have been touched by the respective 
algorithm. Hence, Ibatch := T • |A|, Isgd := T ■ \CPUs\ 
and I ASGD :=T -b-lCPUsl. 

Given runtimes are computed for optimization only, neglect¬ 
ing the initial data transfer to the nodes, which is the same 
for all methods. Errors reported for the synthetic datasets 
are computed as follows: We use the “ground-truth” clus¬ 
ter centers from the data generation step to measure their 
distance to the centers returned by the investigated algo¬ 
rithms. It is obvious that this measure has no absolute 
value. It is only useful to compare the relative differences 
in the convergence of the algorithms. Also, it can not be 
expected that a method will be able to reach a zero error 
result. This is simply because there is no absolute truth for 
overlapping clusters which can be obtained from the gener¬ 
ation process without actually solving the exact NP-HARD 
clustering problem. Hence, the “ground-truth” is most likely 
also biased in some way. 


5.5 Experimental Results 

Scaling 

We evaluate the runtime and scaling properties of our pro¬ 
posed algorithm in a ser ies o f experiments on synthetic and 
real datasets (see sect ion [5^. Eirst, we test a strong scaling 
scenario, where the size of the input data (in k,d and number 
of samples) and the global number of iterations are constant 
for each experiment, while the number of GPUs is increased. 
Independent of the number of iterations and GPUs, ASGD 
is always the fastest method, both for synthetic (see figures 
and real data (figure]^. Notably, it shows (slightly) 


better than linear scaling properties. The SGD and BATGH 
methods suffer from a communication overhead which drives 
them well beyond linear scaling (which is projected by the 
dotted lines in the graphs). Eor SGD, this effect is dominant 
for smaller numbers of iteration^ and softens proportionally 
with the increasing number of iterations. This is due to the 
fact that the communication cost is independent of the num¬ 
ber of iterations. 

The second experiment investigates scaling in the number of 
target clusters /c, given constant /, d, number of GPUs and 
data size. Eigure]^ shows that all methods scale better than 
0{\ogk). While ASGD is faster than the other methods, its 
scaling properties are slightly worse. This is due to the fact 
that the necessary sparseness of the asynchronous updates 
(see section 4.4) is increasing with k. 


^Note: as shown in figure]^ a smaller number of iterations 
is actually sufficient to solve the given problem. 
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Figure 5: Results of a strong scaling experiment on 
the synthetic dataset with k=10, d=10 and ~1TB 
data samples for different numbers of iterations /. 
The related error rates are shown in figure 



Figure 6: Strong scaling of real data. Results for 
with I — 10^° and k — 10.. 1000 on the image classifi¬ 
cation dataset. 
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Figure 7: Scaling the number of clusters k on real 
data. Results for the same experiment as in figure 
Note: here, the dotted lines project a logarithmic 
scaling of the runtime in the number of clusters. 

Convergence Speed 

Convergence (in terms of iterations and time) is an impor¬ 
tant factor in large scale machine learning, where the early 
termination properties of algorithms have a huge practical 
impact. Figureshows the superior convergence properties 
of ASGD. While it finally converges to similar error rates, it 
reaches a fixed error rate with less iterations than SGD or 
BATCH. As shown in figure this early convergence prop¬ 
erty can result in speedups up to one order of magnitude. 



Figure 8: Convergence speed of different gradient 
descent methods used to solve K-Means clustering 
with k — 100 and b = 500 on a 10-dimensional tar¬ 
get space parallelized over 1024 CPUs on a cluster. 
Our novel ASGD method outperforms communica¬ 
tion free SGD [2Q| and MapReduce based BATCH 
[ 5 ] optimization by the order of magnitudes. 


































Optimization Error after Convergence 
The optimization error after full convergenc^for the strong 
scaling experiment (see section |5.5| ) is shown in figures 
andp^ While ASGD outperforms BATCH, it has no signif¬ 
icant difference in the mean error rates compared to SGD. 
However, figure shows, that it tends to be more stable 
in terms of the variance of the non-deterministic K-Means 
results. 


Test Errors and Variance 
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Figure 9: Error rates and their variance of the 
strong scaling experiment on synthetic data shown 
in figure A more detailed view of the variances is 
shown in figure 



Figure 10: Variance of the error rates of the strong 
scaling experiment on synthetic data shown in figure 

m 

Communication Frequency 

Theoretically, more communication should lead to better re¬ 
sults of the ASGD algorithm, as long as the node intercon¬ 
nection provides enough bandwidth. Figure m shows this 

®Full convergence is here defined as the state where the error 
rate is not improving after several iterations. 


effect: as long as the bandwidth suits the the communi¬ 
cation load, the overhead of an ASGD update is marginal 
compared to the SGD baseline. However, the overhead in¬ 
creases to over 30% when the bandwidth is exceeded. 



Figure 11: Communication cost of ASGD. The cost 
of higher communication frequencies | in ASGD 
updates compared to communication free SGD up¬ 
dates. 

As indicated by the date in figure EH we chose b = 500 for 
all of our experiments. However, as noted in section [475] an 
optimal choice for b has to be found for each hardware con¬ 
figuration separately. The number of messages exchanged 



Figure 12: Asynchronous communication rates dur¬ 
ing strong scaling experiment (see figure [^. This 
figure shows the average number of messages sent or 
received by a single CPU over all iterations. “Good” 
messages are defined as those, which were selected 
by the Parzen-window function, contributing to the 
local update. 

during the strong scaling experiments is shown in figure 
While the number of messages sent by each CPU stays close 
to constant, the number of received messages is decreasing. 




































Notably, the impact on the asynchronous communication 
remains stable, because the number of “good” messages is 
also close to constant. Figure shows the impact of the 
communication frequencies of | on the convergence proper¬ 
ties of ASGD. If the frequency is set to lower values, the 
convergence moves towards the original SimuParallelSGD 
behavior. 



Figure 13: Convergence speed of ASGD with a com¬ 
munication frequencies of compared to ^ in 

relation to the other methods. Results on Synthetic 
data with D = 10, A; = 100. 



Figure 14: Convergence speed of ASGD optimiza¬ 
tion (synthetic dataset, /c = 10, d = 10) with and with¬ 
out asynchronous communication (silent). 

Impact of the Asynchronous Communication 
ASGD differs in two major aspects from SimuParallelSGD: 
asynchronous communication and mini-batch updates. In 
order to verify that the single-sided communication is the 
dominating factor of ASGD’s properties, we simplify turned 
off the communication (silent mode) during the optimiza¬ 
tion. Figures and show, that our communication 
model is indeed driving the early convergence feature, both 


in terms of iterations and time needed to reach a given error 
level. 



Figure 15: Early convergence properties of ASGD 
without communication (silent) compared to ASGD 
and SGD. 


Final Aggregation 

As noted in section [ffS) the local results of ASGD could be 
further aggregated by a final reduce step (just like in SGD). 
Figures and show a comparison of both approaches on 
the strong scaling experiment (see figure]^. 


Strong Scaling 



Figure 16: Comparison of the runtime and scalabil¬ 
ity for the two possible final aggregation methods of 
ASGD. Synthetic dataset, /c = 10,d = 10, and ~1TB 
data samples. Error rates for this experiment are 
shown in figure \Pf\ 


6. CONCLUSIONS 

We presented a novel approach towards an effective paral¬ 
lelization of stochastic gradient descent optimization on dis¬ 
tributed memory systems. Our experiments show, that the 
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Figure 17: Error rates for the same experiment 
shown in figure |16[ Comparing the final aggrega¬ 
tion steps of ASGD. 

asynchronous communication scheme can be applied suc¬ 
cessfully to SGD optimizations of machine learning algo¬ 
rithms, providing superior scalability and convergence com¬ 
pared to previous methods. 

Especially the early convergence property of ASGD should 
be of high practical value to many applications in large scale 
machine learning. 
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